Finding meaningful patterns within data has become obtrusive as data collection and management continues to grow at an unprecedented rate.
K-means clustering caters to such highly voluminous & unlabeled data.
We will employ the k-means clustering algorithm to gain insights on customer segmentation in eCommerce data for a retail store based in the UK.
By the end of this presentation we will have discussed the following concepts of k-means clustering:
There are 5 main steps to execute the k-means clustering method.
Clustering is the act of partitioning data into meaningful groups based on similarity in attributes.
The goal of clustering is to create insightful clusters to better understand connections in the data.
\[d(x,C_i)=sqrt(\sum_{i=1}^{N} (x_j−C_{ij})^2)\]
Objective Function:
It is formulated as:
\[ d(x,C_i)=(\sum_{i=1}^{k}*\sum_{x \in C_i}^{}(||x-\mu_i||)^2) \]
\(k\) is the number of clusters.
\(C_i\) represents the number of points in the cluster \(i\)
\(\mu_i\) represents the centroid mean of cluster \(i\)
In this context, similarity is inversely related to the Euclidean distance
The smaller the distance, the greater the similarity between objects
K-means clustering reassigns the data points to each cluster based on the Euclidean Distance calculation.
A new centroid location is set by updating the position at each clusters mean center.
library(ggplot2)
# Plot the WCSS values against the number of clusters
p1<- ggplot(data.frame(K=1:10, WCSS=wcss), aes(x=K, y=WCSS)) +
geom_line() +
geom_point() +
labs(title="Elbow Method to Find Optimal K", x="Number of Clusters (K)", y="Within-Cluster-Sum-of-Squares (WCSS)") +
scale_x_continuous(breaks = seq(0, 10, by = 1))Cluster Results
Since the goal is to evaluate the clusters and find meaning in the results. We can make connections with purchasing frequency and amount to marketing and pricing strategies to promote customer satisfaction.
cluster 1: potentially newer customers, large target for marketing.
cluster 2: target for marketing a select range of costlier items.
cluster 3: likely frequent customers, may benefit from low to mid priced item recommendations to increase or maintain engagement.
cluster 4: lowest investment return, lowest interactions, least concerns
Over the past sixty years, improvements and additions have been made to the base k-means methodology in order to optimize performance and increase scalability
This data may benefit from the exploration of differences in distance formulas, Mahatten distance and Chebyshev distance.
Additional \(k\) selection methods such as the Empirical and Silhouette methods, and the use of principal component analysis or other dimensionality reduction techniques to improve efficiency and overall clustering results.